This document includes the code, output, and interpretation of the analysis conduced on a dataset derived from the VERIS Community Database (VCDB) as outlined in the RiskLens Data Science Candidate Task found at: https://github.com/RiskLens/DataScience_Hiring_Task_2019.
Description of Data: The exercise involved accessing a currated dataset, provided by RiskLens, that was derived from the VERIS Community Database (VCDB). The dataset contained incidents, when they occured (day, month, and year), the entity that caused the incident, type of action take, the type of breach (confidentiality, integrity, or availability), the type and variety of assets that were compromised, the number of records breached, victim count, and the location of the victims.
Summary of Findings from Exploratory Analysis: The number of incidents reported each year ranged from 56 to 395, with the peak ocurring in 2013. Physical attacks were the most common action in 2010, representing 50% of incidence, but declined to represent only 12.5% of incidents by 2018%. Conversely, hacking attacks represented only 4% in 2010 but increased to 29% by 2010. Furthermore, the largest breach which contained 50,000,000 records was the result of a hacking incident in 2016. There were no major changes with regard to the type of actor behind the incidents, with the exception of an atypically large proportion of incidents from internal actors occuring in 2015. Between 2010 and 2018, incidents occuring from user development decreased from 40% to 9%, while incidence from servers and persons increased from 14% to 32% and from 0.5% to 18%, respectively. There was no meaningful change in proportions of confidentiality breaches from 2010 to 2018 (100% to 98%), but there was an increase in the proportion of integrity incidents (3% to 36%) and a decrease in the proportion of availability incidents (68% to 27%). The proportion of incidents that occured from outside the United States decreased increased from 5% in 2010 to 23% in 2018, with a spike at 30% in 2017. There appeared to be a shift in the seasonality of incidents from 2010-2018, with the proportion of incidents in November and December decreasing from 12% in each month to 2% and 0%, respectively; there was an increase in the number of incidents in January from 1% to 12% over the same time frame, suggesting a potential shift in seaasonality of incidents.
Summary of Findings from Modeling Analysis: Assuming a large healthcare employer (1001 employees or larger) and an incident occurs due to an internal actor, it is estimated with 90% confidence that the number of medical records breached falls between 361 and 14,484 records, with 6,170 records being the most likely number of medical records breached during a given incident. With regard to total records (inclusive of medical records), it is estimated with 90% confidence that the umber of total records breached during an incident falls between 6,436 and 295,335, with 25,454 records being the most likely number of total records breached during an incident. It is possible that there is a trend toward an increasing number of medical records occuring with each breach overtime; however, the singular large outlier of a 33,000 record breach makes a firm conclusion difficult as this outlier changes the direction of effect.
library(reader)
library(knitr)
library(httr)
library(ggplot2)
library(kableExtra)
library(dplyr)
library(mc2d)
library(rcompanion)
#Libraries for bonus wordle from summary data
library(RXKCD)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(tidyverse)
library(ciTools)
library(MASS)
library(arm)
VCDB_RAW=read.csv("https://raw.githubusercontent.com/RiskLens/DataScience_Hiring_Task_2019/master/data/vcdb_medical_simplified.csv")
#Filter the data to the years 2010-2018 (inclusive of both).
VCDB=subset(VCDB_RAW, timeline.incident.year >= 2010 & timeline.incident.year <= 2018)
#Visual confirmation of succesful filter
ggplot(data=VCDB, aes(x=timeline.incident.year)) +geom_bar(fill = "darkred")
#Numerical confirmation of successful filter
min(VCDB$timeline.incident.year)
## [1] 2010
max(VCDB$timeline.incident.year)
## [1] 2018
#a. How many total incidents are in the database for each year?
#Display Graphically Ordered Temporally
ggplot(data=VCDB, aes(x=timeline.incident.year)) +geom_bar(fill = "darkred") +geom_text(stat='count', aes(label=..count..), vjust=-1)
#Print Numerical Table
incident.by.year.table=table(VCDB$timeline.incident.year)
kable(incident.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Var1 | Freq |
|---|---|
| 2010 | 201 |
| 2011 | 162 |
| 2012 | 286 |
| 2013 | 395 |
| 2014 | 310 |
| 2015 | 316 |
| 2016 | 258 |
| 2017 | 197 |
| 2018 | 56 |
#Visually graph actions by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = action ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
action.by.year.table=table(VCDB$timeline.incident.year,VCDB$action)
kable(action.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| environmental | error | hacking | malware | misuse | physical | social | unknown | |
|---|---|---|---|---|---|---|---|---|
| 2010 | 0 | 57 | 9 | 1 | 20 | 101 | 0 | 13 |
| 2011 | 0 | 41 | 15 | 2 | 32 | 61 | 1 | 10 |
| 2012 | 1 | 65 | 17 | 4 | 61 | 122 | 3 | 13 |
| 2013 | 0 | 112 | 28 | 6 | 108 | 131 | 0 | 10 |
| 2014 | 1 | 95 | 23 | 7 | 89 | 89 | 1 | 5 |
| 2015 | 0 | 105 | 29 | 10 | 121 | 47 | 3 | 1 |
| 2016 | 0 | 81 | 57 | 29 | 45 | 38 | 6 | 2 |
| 2017 | 0 | 49 | 32 | 25 | 49 | 28 | 12 | 2 |
| 2018 | 0 | 16 | 16 | 4 | 11 | 7 | 2 | 0 |
#Generate proportions with new dataset
action.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$action),1) #percentages by row
action.by.year.table.prop.data=as.data.frame(action.by.year.table.prop.table)
action.by.year.table.prop.data=action.by.year.table.prop.data %>% rename( year=Var1, action=Var2, proportion=Freq)
#Print Proportions Table
kable(action.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| environmental | error | hacking | malware | misuse | physical | social | unknown | |
|---|---|---|---|---|---|---|---|---|
| 2010 | 0.0000000 | 0.2835821 | 0.0447761 | 0.0049751 | 0.0995025 | 0.5024876 | 0.0000000 | 0.0646766 |
| 2011 | 0.0000000 | 0.2530864 | 0.0925926 | 0.0123457 | 0.1975309 | 0.3765432 | 0.0061728 | 0.0617284 |
| 2012 | 0.0034965 | 0.2272727 | 0.0594406 | 0.0139860 | 0.2132867 | 0.4265734 | 0.0104895 | 0.0454545 |
| 2013 | 0.0000000 | 0.2835443 | 0.0708861 | 0.0151899 | 0.2734177 | 0.3316456 | 0.0000000 | 0.0253165 |
| 2014 | 0.0032258 | 0.3064516 | 0.0741935 | 0.0225806 | 0.2870968 | 0.2870968 | 0.0032258 | 0.0161290 |
| 2015 | 0.0000000 | 0.3322785 | 0.0917722 | 0.0316456 | 0.3829114 | 0.1487342 | 0.0094937 | 0.0031646 |
| 2016 | 0.0000000 | 0.3139535 | 0.2209302 | 0.1124031 | 0.1744186 | 0.1472868 | 0.0232558 | 0.0077519 |
| 2017 | 0.0000000 | 0.2487310 | 0.1624365 | 0.1269036 | 0.2487310 | 0.1421320 | 0.0609137 | 0.0101523 |
| 2018 | 0.0000000 | 0.2857143 | 0.2857143 | 0.0714286 | 0.1964286 | 0.1250000 | 0.0357143 | 0.0000000 |
#Proportion Visualizations
#Proportion bar charts
Action.by.year.table.prop.data.bar.chart=ggplot(action.by.year.table.prop.data, aes(x=year, y=proportion, fill=action))+ geom_bar(stat="identity", position = "fill")
Action.by.year.table.prop.data.bar.chart
#individual line graphs
Action.by.year.prop.line.graph<-ggplot(action.by.year.table.prop.data, aes(x=year, y=proportion, group=action)) +
geom_line(aes(color=action))+
geom_point(aes(color=action))
print(Action.by.year.prop.line.graph)
#Facet by action to pull apart and visualize
Facet.action.by.year.prop.line.graph=Action.by.year.prop.line.graph+facet_wrap(. ~ action,ncol=3)
print(Facet.action.by.year.prop.line.graph)
#Visually graph actor by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = actor ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
actor.by.year.table=table(VCDB$timeline.incident.year,VCDB$actor)
kable(actor.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| external | internal | partner | unknown | |
|---|---|---|---|---|
| 2010 | 113 | 61 | 15 | 12 |
| 2011 | 83 | 63 | 6 | 10 |
| 2012 | 151 | 103 | 19 | 13 |
| 2013 | 162 | 200 | 22 | 11 |
| 2014 | 125 | 166 | 16 | 3 |
| 2015 | 94 | 211 | 8 | 3 |
| 2016 | 146 | 98 | 10 | 4 |
| 2017 | 102 | 84 | 9 | 2 |
| 2018 | 27 | 27 | 1 | 1 |
#Generate proportions with new dataset
actor.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$actor),1) #percentages by row
actor.by.year.table.prop.data=as.data.frame(actor.by.year.table.prop.table)
actor.by.year.table.prop.data=actor.by.year.table.prop.data %>% rename( year=Var1, actor=Var2, proportion=Freq)
#Print Proportions Table
kable(actor.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| external | internal | partner | unknown | |
|---|---|---|---|---|
| 2010 | 0.5621891 | 0.3034826 | 0.0746269 | 0.0597015 |
| 2011 | 0.5123457 | 0.3888889 | 0.0370370 | 0.0617284 |
| 2012 | 0.5279720 | 0.3601399 | 0.0664336 | 0.0454545 |
| 2013 | 0.4101266 | 0.5063291 | 0.0556962 | 0.0278481 |
| 2014 | 0.4032258 | 0.5354839 | 0.0516129 | 0.0096774 |
| 2015 | 0.2974684 | 0.6677215 | 0.0253165 | 0.0094937 |
| 2016 | 0.5658915 | 0.3798450 | 0.0387597 | 0.0155039 |
| 2017 | 0.5177665 | 0.4263959 | 0.0456853 | 0.0101523 |
| 2018 | 0.4821429 | 0.4821429 | 0.0178571 | 0.0178571 |
#Proportion Visualizations
#Proportion bar charts
actor.by.year.table.prop.data.bar.chart=ggplot(actor.by.year.table.prop.data, aes(x=year, y=proportion, fill=actor))+ geom_bar(stat="identity", position = "fill")
actor.by.year.table.prop.data.bar.chart
#individual line graphs
actor.by.year.prop.line.graph<-ggplot(actor.by.year.table.prop.data, aes(x=year, y=proportion, group=actor)) +
geom_line(aes(color=actor))+
geom_point(aes(color=actor))
print(actor.by.year.prop.line.graph)
#Facet by actor to pull apart and visualize
Facet.actor.by.year.prop.line.graph=actor.by.year.prop.line.graph+facet_wrap(. ~ actor,ncol=3)
print(Facet.actor.by.year.prop.line.graph)
#Visually graph asset by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = asset ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
asset.by.year.table=table(VCDB$timeline.incident.year,VCDB$asset)
kable(asset.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| kiosk/term | media | network | person | server | unknown | user dev | ||
|---|---|---|---|---|---|---|---|---|
| 2010 | 0 | 0 | 83 | 0 | 1 | 29 | 8 | 80 |
| 2011 | 0 | 0 | 58 | 0 | 4 | 46 | 8 | 46 |
| 2012 | 0 | 1 | 90 | 1 | 7 | 74 | 18 | 95 |
| 2013 | 0 | 0 | 122 | 3 | 9 | 108 | 26 | 127 |
| 2014 | 1 | 0 | 99 | 1 | 15 | 88 | 28 | 78 |
| 2015 | 1 | 0 | 113 | 0 | 19 | 123 | 16 | 44 |
| 2016 | 1 | 1 | 78 | 0 | 27 | 120 | 10 | 21 |
| 2017 | 5 | 1 | 57 | 3 | 28 | 74 | 13 | 16 |
| 2018 | 1 | 1 | 19 | 0 | 10 | 18 | 2 | 5 |
#Generate proportions with new dataset
asset.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$asset),1) #percentages by row
asset.by.year.table.prop.data=as.data.frame(asset.by.year.table.prop.table)
asset.by.year.table.prop.data=asset.by.year.table.prop.data %>% rename( year=Var1, asset=Var2, proportion=Freq)
#Print Proportions Table
kable(asset.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| kiosk/term | media | network | person | server | unknown | user dev | ||
|---|---|---|---|---|---|---|---|---|
| 2010 | 0.0000000 | 0.0000000 | 0.4129353 | 0.0000000 | 0.0049751 | 0.1442786 | 0.0398010 | 0.3980100 |
| 2011 | 0.0000000 | 0.0000000 | 0.3580247 | 0.0000000 | 0.0246914 | 0.2839506 | 0.0493827 | 0.2839506 |
| 2012 | 0.0000000 | 0.0034965 | 0.3146853 | 0.0034965 | 0.0244755 | 0.2587413 | 0.0629371 | 0.3321678 |
| 2013 | 0.0000000 | 0.0000000 | 0.3088608 | 0.0075949 | 0.0227848 | 0.2734177 | 0.0658228 | 0.3215190 |
| 2014 | 0.0032258 | 0.0000000 | 0.3193548 | 0.0032258 | 0.0483871 | 0.2838710 | 0.0903226 | 0.2516129 |
| 2015 | 0.0031646 | 0.0000000 | 0.3575949 | 0.0000000 | 0.0601266 | 0.3892405 | 0.0506329 | 0.1392405 |
| 2016 | 0.0038760 | 0.0038760 | 0.3023256 | 0.0000000 | 0.1046512 | 0.4651163 | 0.0387597 | 0.0813953 |
| 2017 | 0.0253807 | 0.0050761 | 0.2893401 | 0.0152284 | 0.1421320 | 0.3756345 | 0.0659898 | 0.0812183 |
| 2018 | 0.0178571 | 0.0178571 | 0.3392857 | 0.0000000 | 0.1785714 | 0.3214286 | 0.0357143 | 0.0892857 |
#Proportion Visualizations
#Proportion bar charts
asset.by.year.table.prop.data.bar.chart=ggplot(asset.by.year.table.prop.data, aes(x=year, y=proportion, fill=asset))+ geom_bar(stat="identity", position = "fill")
asset.by.year.table.prop.data.bar.chart
#individual line graphs
asset.by.year.prop.line.graph<-ggplot(asset.by.year.table.prop.data, aes(x=year, y=proportion, group=asset)) +
geom_line(aes(color=asset))+
geom_point(aes(color=asset))
print(asset.by.year.prop.line.graph)
#Facet by asset to pull apart and visualize
Facet.asset.by.year.prop.line.graph=asset.by.year.prop.line.graph+facet_wrap(. ~ asset,ncol=3)
print(Facet.asset.by.year.prop.line.graph)
#Visually graph attribute.confidentiality by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = attribute.confidentiality ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
attribute.confidentiality.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.confidentiality)
kable(attribute.confidentiality.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 0 | 201 |
| 2011 | 0 | 162 |
| 2012 | 1 | 285 |
| 2013 | 7 | 388 |
| 2014 | 4 | 306 |
| 2015 | 2 | 314 |
| 2016 | 6 | 252 |
| 2017 | 0 | 197 |
| 2018 | 1 | 55 |
#Generate proportions with new dataset
attribute.confidentiality.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.confidentiality),1) #percentages by row
attribute.confidentiality.by.year.table.prop.data=as.data.frame(attribute.confidentiality.by.year.table.prop.table)
attribute.confidentiality.by.year.table.prop.data=attribute.confidentiality.by.year.table.prop.data %>% rename( year=Var1, attribute.confidentiality=Var2, proportion=Freq)
#Print Proportions Table
kable(attribute.confidentiality.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 0.0000000 | 1.0000000 |
| 2011 | 0.0000000 | 1.0000000 |
| 2012 | 0.0034965 | 0.9965035 |
| 2013 | 0.0177215 | 0.9822785 |
| 2014 | 0.0129032 | 0.9870968 |
| 2015 | 0.0063291 | 0.9936709 |
| 2016 | 0.0232558 | 0.9767442 |
| 2017 | 0.0000000 | 1.0000000 |
| 2018 | 0.0178571 | 0.9821429 |
#Proportion Visualizations
#Proportion bar charts
attribute.confidentiality.by.year.table.prop.data.bar.chart=ggplot(attribute.confidentiality.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.confidentiality))+ geom_bar(stat="identity", position = "fill")
attribute.confidentiality.by.year.table.prop.data.bar.chart
#individual line graphs
attribute.confidentiality.by.year.prop.line.graph<-ggplot(attribute.confidentiality.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.confidentiality)) +
geom_line(aes(color=attribute.confidentiality))+
geom_point(aes(color=attribute.confidentiality))
print(attribute.confidentiality.by.year.prop.line.graph)
#Facet by attribute.confidentiality to pull apart and visualize
Facet.attribute.confidentiality.by.year.prop.line.graph=attribute.confidentiality.by.year.prop.line.graph+facet_wrap(. ~ attribute.confidentiality,ncol=3)
print(Facet.attribute.confidentiality.by.year.prop.line.graph)
#Visually graph attribute.integrity by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = attribute.integrity ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
attribute.integrity.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.integrity)
kable(attribute.integrity.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 195 | 6 |
| 2011 | 151 | 11 |
| 2012 | 269 | 17 |
| 2013 | 367 | 28 |
| 2014 | 282 | 28 |
| 2015 | 277 | 39 |
| 2016 | 192 | 66 |
| 2017 | 138 | 59 |
| 2018 | 36 | 20 |
#Generate proportions with new dataset
attribute.integrity.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.integrity),1) #percentages by row
attribute.integrity.by.year.table.prop.data=as.data.frame(attribute.integrity.by.year.table.prop.table)
attribute.integrity.by.year.table.prop.data=attribute.integrity.by.year.table.prop.data %>% rename( year=Var1, attribute.integrity=Var2, proportion=Freq)
#Print Proportions Table
kable(attribute.integrity.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 0.9701493 | 0.0298507 |
| 2011 | 0.9320988 | 0.0679012 |
| 2012 | 0.9405594 | 0.0594406 |
| 2013 | 0.9291139 | 0.0708861 |
| 2014 | 0.9096774 | 0.0903226 |
| 2015 | 0.8765823 | 0.1234177 |
| 2016 | 0.7441860 | 0.2558140 |
| 2017 | 0.7005076 | 0.2994924 |
| 2018 | 0.6428571 | 0.3571429 |
#Proportion Visualizations
#Proportion bar charts
attribute.integrity.by.year.table.prop.data.bar.chart=ggplot(attribute.integrity.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.integrity))+ geom_bar(stat="identity", position = "fill")
attribute.integrity.by.year.table.prop.data.bar.chart
#individual line graphs
attribute.integrity.by.year.prop.line.graph<-ggplot(attribute.integrity.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.integrity)) +
geom_line(aes(color=attribute.integrity))+
geom_point(aes(color=attribute.integrity))
print(attribute.integrity.by.year.prop.line.graph)
#Facet by attribute.integrity to pull apart and visualize
Facet.attribute.integrity.by.year.prop.line.graph=attribute.integrity.by.year.prop.line.graph+facet_wrap(. ~ attribute.integrity,ncol=3)
print(Facet.attribute.integrity.by.year.prop.line.graph)
#Visually graph attribute.availability by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = attribute.availability ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
attribute.availability.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.availability)
kable(attribute.availability.by.year.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 65 | 136 |
| 2011 | 83 | 79 |
| 2012 | 130 | 156 |
| 2013 | 217 | 178 |
| 2014 | 183 | 127 |
| 2015 | 237 | 79 |
| 2016 | 162 | 96 |
| 2017 | 132 | 65 |
| 2018 | 41 | 15 |
#Generate proportions with new dataset
attribute.availability.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.availability),1) #percentages by row
attribute.availability.by.year.table.prop.data=as.data.frame(attribute.availability.by.year.table.prop.table)
attribute.availability.by.year.table.prop.data=attribute.availability.by.year.table.prop.data %>% rename( year=Var1, attribute.availability=Var2, proportion=Freq)
#Print Proportions Table
kable(attribute.availability.by.year.table.prop.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| False | True | |
|---|---|---|
| 2010 | 0.3233831 | 0.6766169 |
| 2011 | 0.5123457 | 0.4876543 |
| 2012 | 0.4545455 | 0.5454545 |
| 2013 | 0.5493671 | 0.4506329 |
| 2014 | 0.5903226 | 0.4096774 |
| 2015 | 0.7500000 | 0.2500000 |
| 2016 | 0.6279070 | 0.3720930 |
| 2017 | 0.6700508 | 0.3299492 |
| 2018 | 0.7321429 | 0.2678571 |
#Proportion Visualizations
#Proportion bar charts
attribute.availability.by.year.table.prop.data.bar.chart=ggplot(attribute.availability.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.availability))+ geom_bar(stat="identity", position = "fill")
attribute.availability.by.year.table.prop.data.bar.chart
#individual line graphs
attribute.availability.by.year.prop.line.graph<-ggplot(attribute.availability.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.availability)) +
geom_line(aes(color=attribute.availability))+
geom_point(aes(color=attribute.availability))
print(attribute.availability.by.year.prop.line.graph)
#Facet by attribute.availability to pull apart and visualize
Facet.attribute.availability.by.year.prop.line.graph=attribute.availability.by.year.prop.line.graph+facet_wrap(. ~ attribute.availability,ncol=3)
print(Facet.attribute.availability.by.year.prop.line.graph)
#Create new variable marking inside or outside US
VCDB$US_BOOLEAN=ifelse(VCDB$victim.country=="US", "US", "NON_US")
#Visually graph asset by years
ggplot(aes(x = timeline.incident.year ), data = VCDB) +
geom_histogram(aes(fill = US_BOOLEAN ), binwidth=1, colour="grey20", lwd=0.2) +
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
#Generate raw numbers
victim.country.by.year.table=table(VCDB$timeline.incident.year,VCDB$US_BOOLEAN)
print(victim.country.by.year.table)
##
## NON_US US
## 2010 10 191
## 2011 21 141
## 2012 31 255
## 2013 69 326
## 2014 54 256
## 2015 76 240
## 2016 63 195
## 2017 59 138
## 2018 13 43
#Generate proportions with new dataset
victim.country.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$US_BOOLEAN),1) #percentages by row
victim.country.by.year.table.prop.data=as.data.frame(victim.country.by.year.table.prop.table)
victim.country.by.year.table.prop.data=victim.country.by.year.table.prop.data %>% rename(year=Var1, victim.country=Var2, proportion=Freq)
kable(victim.country.by.year.table.prop.data) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| year | victim.country | proportion |
|---|---|---|
| 2010 | NON_US | 0.0497512 |
| 2011 | NON_US | 0.1296296 |
| 2012 | NON_US | 0.1083916 |
| 2013 | NON_US | 0.1746835 |
| 2014 | NON_US | 0.1741935 |
| 2015 | NON_US | 0.2405063 |
| 2016 | NON_US | 0.2441860 |
| 2017 | NON_US | 0.2994924 |
| 2018 | NON_US | 0.2321429 |
| 2010 | US | 0.9502488 |
| 2011 | US | 0.8703704 |
| 2012 | US | 0.8916084 |
| 2013 | US | 0.8253165 |
| 2014 | US | 0.8258065 |
| 2015 | US | 0.7594937 |
| 2016 | US | 0.7558140 |
| 2017 | US | 0.7005076 |
| 2018 | US | 0.7678571 |
#Proportion Visualizations
#Proportion bar charts
victim.country.by.year.table.prop.data.bar.chart=ggplot(victim.country.by.year.table.prop.data, aes(x=year, y=proportion, fill=victim.country)) + geom_bar(stat="identity",
position = "fill")
victim.country.by.year.table.prop.data.bar.chart
#individual line graphs
victim.country.by.year.prop.line.graph<-ggplot(victim.country.by.year.table.prop.data, aes(x=year, y=proportion, group=victim.country)) +
geom_line(aes(color=victim.country))+
geom_point(aes(color=victim.country))
print(victim.country.by.year.prop.line.graph)
#Facet by victim.country to pull apart and visualize
Facet.victim.country.by.year.prop.line.graph=victim.country.by.year.prop.line.graph+facet_wrap(. ~ victim.country,ncol=3)
print(Facet.victim.country.by.year.prop.line.graph)
Let’s assume that you work for a large healthcare employer (1001 employees or larger), and you are scoping a risk scenario where you are worried about an insider threat (actor: internal) compromising the confidentiality of medical records. Assuming that you will have a cybersecurity incident this year, based on the data you have can you come up with a model that will estimate, with 90% confidence, the range of the counts of medical records that will be compromised in such an incident (minimum and maximum)? Within that 90% confidence interval, what is the most likely count of the breached records? You may ignore those employers with unknown employee counts for the sake of time. Bonus points if you integrate the year of breach into your model – are the trends changing?
#Subset data to include only insider threats (actor:interal).
Internal.actor.data=subset(VCDB, actor=="internal" & victim.employee_count %in% c("1001 to 1000","10001 to 25000", "250001 to 50000", "50001 to 100000","over 100000", "large"))
#Visual Check
ggplot(aes(x = timeline.incident.year ), data = Internal.actor.data) +
geom_histogram(aes(fill = actor ), binwidth=1, colour="darkred", lwd=0.2) + geom_bar(fill = "darkred")
scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
#Numerical Table Check
internal.actor.table=table(Internal.actor.data$actor,Internal.actor.data$timeline.incident.year)
kable(internal.actor.table) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | |
|---|---|---|---|---|---|---|---|---|---|
| external | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| internal | 6 | 3 | 9 | 13 | 10 | 23 | 20 | 13 | 4 |
| partner | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| unknown | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#Confidentiality Records Analysis
#Visualize counts of confidentiality.medical_records records by breahch
ggplot(aes(x = confidentiality.medical_records ), data = Internal.actor.data) + geom_histogram(aes(fill = actor ),
binwidth=100,colour="darkred", lwd=0.2) + geom_bar(fill = "darkred")
scale_x_continuous(breaks=seq(0,max(Internal.actor.data$confidentiality.medical_records), 1))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
#Numerical Table Check
internal.actor.table.confidentiality.medical.records=table(Internal.actor.data$confidentiality.medical_records)
kable(internal.actor.table.confidentiality.medical.records) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Var1 | Freq |
|---|---|
| 0 | 48 |
| 1 | 9 |
| 2 | 3 |
| 3 | 1 |
| 8 | 1 |
| 9 | 1 |
| 14 | 2 |
| 74 | 1 |
| 104 | 1 |
| 111 | 1 |
| 112 | 1 |
| 122 | 1 |
| 164 | 1 |
| 197 | 1 |
| 430 | 1 |
| 568 | 1 |
| 600 | 1 |
| 652 | 1 |
| 677 | 1 |
| 827 | 1 |
| 1100 | 1 |
| 1126 | 1 |
| 1136 | 1 |
| 1215 | 1 |
| 1290 | 1 |
| 1300 | 1 |
| 1740 | 1 |
| 1920 | 1 |
| 2027 | 1 |
| 2264 | 1 |
| 2284 | 1 |
| 2777 | 1 |
| 2928 | 1 |
| 3200 | 1 |
| 3247 | 1 |
| 3598 | 1 |
| 5400 | 1 |
| 6000 | 1 |
| 6800 | 1 |
| 7445 | 1 |
| 7500 | 1 |
| 8020 | 1 |
| 33000 | 1 |
#obtain min, max, mode values for PERT distribution analysis
min.pert=min(Internal.actor.data$confidentiality.medical_records)
max.pert=max(Internal.actor.data$confidentiality.medical_records)
#create mode function and then calculate mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode.pert=getmode(Internal.actor.data$confidentiality.medical_records)
#set number of runs for analysis
num.run.pert=10000
#Generate PERT Statistics
pert.confidentiality.medical_records=rpert(num.run.pert,min.pert,mode.pert,max.pert, shape=4)
pert.data=data.frame(pert.confidentiality.medical_records)
#Plot Distribution of Pert Statistics
gg <- ggplot(pert.data, aes(x = pert.confidentiality.medical_records))
gg <- gg + geom_histogram(aes(y = ..density..),
color="black",
fill = "white",
binwidth = 100)
gg <- gg + geom_density(fill = "darkred", alpha = 1/3)
gg <- gg + theme_bw()
gg
#Generate Summary Statistics for Answer
mode.pert.result=getmode(pert.data$pert.confidentiality.medical_records)
confidence.interval.pert=quantile(pert.data$pert.confidentiality.medical_records, c(0.05,0.95))
print(mode.pert.result)
## [1] 7870.234
print(confidence.interval.pert)
## 5% 95%
## 336.1289 14937.7834
#Examine number of cases of confidentiality medical records over time and generate 90% confidence intervals
#Utilize Poisson Distribution Due to Count Data
confidentiality.medical_records.over.time=glm(confidentiality.medical_records~timeline.incident.year,data=Internal.actor.data,
family=poisson)
summary(confidentiality.medical_records.over.time)
##
## Call:
## glm(formula = confidentiality.medical_records ~ timeline.incident.year,
## family = poisson, data = Internal.actor.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -50.16 -48.02 -45.71 -15.07 393.06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -67.915067 2.989442 -22.72 <2e-16 ***
## timeline.incident.year 0.037191 0.001484 25.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 412889 on 100 degrees of freedom
## Residual deviance: 412251 on 99 degrees of freedom
## AIC: 412624
##
## Number of Fisher Scoring iterations: 7
df_ints <- Internal.actor.data%>%
add_ci(confidentiality.medical_records.over.time, names = c("lcb", "ucb"), alpha = 0.1) %>%
add_pi(confidentiality.medical_records.over.time, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000) %>%
print()
## Warning in add_pi.glm(., confidentiality.medical_records.over.time, names
## = c("lpb", : The response is not continuous, so Prediction Intervals are
## approximate
## # A tibble: 101 x 28
## incident_id timeline.incide… timeline.incide… timeline.incide…
## * <fct> <dbl> <dbl> <fct>
## 1 B124687D-1… 27 8 ""
## 2 89A98591-D… NA NA ""
## 3 1c0cdc80-e… 1 3 ""
## 4 ec902e80-d… NA NA ""
## 5 9ac51270-d… 25 10 ""
## 6 FD47F388-A… NA NA ""
## 7 85232D3A-6… NA 1 ""
## 8 6AA12732-9… NA NA ""
## 9 E2BFCF0E-5… 6 11 ""
## 10 8d8c0670-6… NA NA ""
## # ... with 91 more rows, and 24 more variables:
## # timeline.incident.year <int>, actor <fct>, action <fct>,
## # attribute.confidentiality <fct>, attribute.integrity <fct>,
## # attribute.availability <fct>, asset <fct>, asset.variety <fct>,
## # confidentiality.medical_records <dbl>,
## # confidentiality.payment_records <dbl>,
## # confidentiality.personal_records <dbl>,
## # confidentiality.total_record_count <dbl>, victim.employee_count <fct>,
## # victim.state <fct>, victim.country <fct>, victim.victim_id <fct>,
## # summary <fct>, reference <fct>, US_BOOLEAN <chr>, pred <dbl>,
## # lcb <dbl>, ucb <dbl>, lpb <int>, upb <int>
df_ints %>%
ggplot(aes(x = timeline.incident.year, y = confidentiality.medical_records)) +
geom_point(size = 2) +
ggtitle("Poisson Regression", subtitle = "Model fit (black line), with prediction intervals (gray), confidence intervals (dark gray)") +
geom_line(aes(x = timeline.incident.year, y = pred), size = 1.2) +
geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) +
geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)
#Outlier spotted, rerun analysis with no outlier data
#subset based on <10,000 records criteria based on visual inspection
Internal.actor.data.no.outlier=subset(Internal.actor.data, confidentiality.medical_records<10000)
confidentiality.medical_records.over.time.no.outlier=glm(confidentiality.medical_records~timeline.incident.year,data=Internal.actor.data.no.outlier,
family=poisson)
summary(confidentiality.medical_records.over.time.no.outlier)
##
## Call:
## glm(formula = confidentiality.medical_records ~ timeline.incident.year,
## family = poisson, data = Internal.actor.data.no.outlier)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -56.843 -36.851 -33.682 -3.745 165.792
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 355.856371 3.274037 108.7 <2e-16 ***
## timeline.incident.year -0.173368 0.001626 -106.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 242532 on 99 degrees of freedom
## Residual deviance: 231526 on 98 degrees of freedom
## AIC: 231886
##
## Number of Fisher Scoring iterations: 7
df_ints <- Internal.actor.data.no.outlier%>%
add_ci(confidentiality.medical_records.over.time, names = c("lcb", "ucb"), alpha = 0.1) %>%
add_pi(confidentiality.medical_records.over.time, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000) %>%
print()
## Warning in add_pi.glm(., confidentiality.medical_records.over.time, names
## = c("lpb", : The response is not continuous, so Prediction Intervals are
## approximate
## # A tibble: 100 x 28
## incident_id timeline.incide… timeline.incide… timeline.incide…
## * <fct> <dbl> <dbl> <fct>
## 1 B124687D-1… 27 8 ""
## 2 89A98591-D… NA NA ""
## 3 1c0cdc80-e… 1 3 ""
## 4 ec902e80-d… NA NA ""
## 5 9ac51270-d… 25 10 ""
## 6 FD47F388-A… NA NA ""
## 7 85232D3A-6… NA 1 ""
## 8 6AA12732-9… NA NA ""
## 9 E2BFCF0E-5… 6 11 ""
## 10 8d8c0670-6… NA NA ""
## # ... with 90 more rows, and 24 more variables:
## # timeline.incident.year <int>, actor <fct>, action <fct>,
## # attribute.confidentiality <fct>, attribute.integrity <fct>,
## # attribute.availability <fct>, asset <fct>, asset.variety <fct>,
## # confidentiality.medical_records <dbl>,
## # confidentiality.payment_records <dbl>,
## # confidentiality.personal_records <dbl>,
## # confidentiality.total_record_count <dbl>, victim.employee_count <fct>,
## # victim.state <fct>, victim.country <fct>, victim.victim_id <fct>,
## # summary <fct>, reference <fct>, US_BOOLEAN <chr>, pred <dbl>,
## # lcb <dbl>, ucb <dbl>, lpb <int>, upb <int>
df_ints %>%
ggplot(aes(x = timeline.incident.year, y = confidentiality.medical_records)) +
geom_point(size = 2) +
ggtitle("Poisson Regression", subtitle = "Model fit (black line), with prediction intervals (gray), confidence intervals (dark gray)") +
geom_line(aes(x = timeline.incident.year, y = pred), size = 1.2) +
geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) +
geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)
#Total Records Analysis
#Visualize counts of confidentiality.total_record_count records by breahch
ggplot(aes(x = confidentiality.total_record_count ), data = Internal.actor.data) + geom_histogram(aes(fill = actor ),
binwidth=100,colour="darkred", lwd=0.2) + geom_bar(fill = "darkred")
scale_x_continuous(breaks=seq(0,max(Internal.actor.data$confidentiality.total_record_count), 1))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
#Numerical Table Check
internal.actor.table.confidentiality.total_record_count=table(Internal.actor.data$confidentiality.total_record_count)
kable(internal.actor.table.confidentiality.total_record_count) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Var1 | Freq |
|---|---|
| 0 | 30 |
| 1 | 10 |
| 2 | 5 |
| 3 | 1 |
| 8 | 1 |
| 9 | 1 |
| 14 | 2 |
| 31 | 1 |
| 44 | 1 |
| 48 | 1 |
| 50 | 1 |
| 74 | 1 |
| 104 | 1 |
| 111 | 1 |
| 112 | 1 |
| 122 | 1 |
| 164 | 1 |
| 197 | 1 |
| 250 | 1 |
| 430 | 1 |
| 520 | 1 |
| 568 | 1 |
| 600 | 1 |
| 652 | 1 |
| 677 | 1 |
| 827 | 1 |
| 848 | 1 |
| 900 | 1 |
| 1062 | 1 |
| 1100 | 1 |
| 1124 | 1 |
| 1126 | 1 |
| 1136 | 1 |
| 1215 | 1 |
| 1290 | 1 |
| 1300 | 1 |
| 1740 | 1 |
| 1920 | 1 |
| 2027 | 1 |
| 2264 | 1 |
| 2284 | 1 |
| 2777 | 1 |
| 2928 | 1 |
| 3200 | 1 |
| 3247 | 1 |
| 3403 | 1 |
| 3598 | 1 |
| 5300 | 1 |
| 5400 | 1 |
| 6000 | 1 |
| 6800 | 1 |
| 7445 | 1 |
| 7500 | 1 |
| 8020 | 1 |
| 12061 | 1 |
| 33000 | 1 |
| 108000 | 1 |
| 655550 | 1 |
#obtain min, max, mode values for PERT distribution analysis
min.pert.total=min(Internal.actor.data$confidentiality.total_record_count)
max.pert.total=max(Internal.actor.data$confidentiality.total_record_count)
#create mode function and then calculate mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode.pert.total=getmode(Internal.actor.data$confidentiality.total_record_count)
#set number of runs for analysis
num.run.pert.total=10000
#Generate PERT Statistics
pert.confidentiality.total_record_count=rpert(num.run.pert.total,min.pert.total,mode.pert.total,max.pert.total, shape=4)
pert.data.total=data.frame(pert.confidentiality.total_record_count)
#Plot Distribution of Pert Statistics
gg.total <- ggplot(pert.data.total, aes(x = pert.confidentiality.total_record_count))
gg.total <- gg.total + geom_histogram(aes(y = ..density..),
color="black",
fill = "white",
binwidth = 100)
gg.total <- gg.total + geom_density(fill = "darkred", alpha = 1/3)
gg.total <- gg.total + theme_bw()
gg.total
#Generate Summary Statistics for Answer
mode.pert.result.total=getmode(pert.data.total$pert.confidentiality.total_record_count)
confidence.interval.pert.total=quantile(pert.data.total$pert.confidentiality.total_record_count, c(0.05,0.95))
print(mode.pert.result.total)
## [1] 163679.2
print(confidence.interval.pert.total)
## 5% 95%
## 6317.488 299338.743
#Generate WordCloud from Text (This section takes a while to process)
#Edit summary text vector into corpus for wordle and generate table of frequencies
ap.corpus <- Corpus(VectorSource(VCDB$summary))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(ap.corpus, removePunctuation):
## transformation drops documents
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(ap.corpus, content_transformer(tolower)):
## transformation drops documents
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
## Warning in tm_map.SimpleCorpus(ap.corpus, function(x) removeWords(x,
## stopwords("english"))): transformation drops documents
ap.corpus <- Corpus(VectorSource(ap.corpus))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 6211 1923 1016 639 482 330 213 198 176 133 119 116 84 87 70
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 74 58 49 51 44 37 27 33 25 30 29 26 20 35 27
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 16 21 26 15 16 18 22 17 21 14 8 10 7 7 5
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 8 8 13 12 5 11 9 10 13 8 8 8 7 2 8
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 9 10 3 7 5 7 11 2 4 5 6 1 4 8 8
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 5 6 2 3 2 5 6 3 4 7 2 4 3 2 4
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
## 5 6 3 3 2 1 3 2 4 5 3 1 2 2 2
## 106 107 109 110 111 112 113 114 115 116 119 120 122 123 126
## 1 4 2 2 1 2 2 2 2 2 4 3 1 1 1
## 127 128 130 131 132 134 135 137 138 139 140 141 144 145 148
## 2 1 1 2 1 2 1 1 1 2 2 3 1 1 3
## 151 152 153 154 155 156 160 162 163 164 165 167 168 169 171
## 2 1 1 1 1 1 2 1 1 1 3 1 1 1 1
## 172 175 176 178 181 188 191 196 202 205 207 209 212 213 214
## 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## 218 222 226 228 230 237 239 244 249 250 260 265 267 268 269
## 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1
## 276 281 286 307 309 321 329 338 340 345 348 361 364 365 366
## 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1
## 369 371 372 374 392 394 401 409 410 415 437 443 456 471 504
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 516 527 644 703 836 837 846 920 1021 1154 1319 1424 1488 1773 2696
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
pal2 <- brewer.pal(8,"Dark2")
#Generate figure as PNG
png("wordcloud_packages.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()
## quartz_off_screen
## 2